Computer vision applications have heavily relied on the linear combination of Lambertian diffuse and microfacet specular reflection models for representing reflected radiance, which turns out to be physically incompatible and limited in applicability. In this paper, we derive a novel analytical reflectance model, which we refer to as Fresnel Microfacet BRDF model, that is physically accurate and generalizes to various real-world surfaces. Our key idea is to model the Fresnel reflection and transmission of the surface microgeometry with a collection of oriented mirror facets, both for body and surface reflections. We carefully derive the Fresnel reflection and transmission for each microfacet as well as the light transport between them in the subsurface. This physically-grounded modeling also allows us to express the polarimetric behavior of reflected light in addition to its radiometric behavior. That is, FMBRDF unifies not only body and surface reflections but also light reflection in radiometry and polarization and represents them in a single model. Experimental results demonstrate its effectiveness in accuracy, expressive power, and image-based estimation.
translated by 谷歌翻译
我们介绍了一种新型的多视图立体声(MVS)方法,该方法不仅可以同时恢复每个像素深度,而且还可以恢复表面正常状态,以及在已知但自然照明下捕获的无纹理,复杂的非斜面表面的反射。我们的关键想法是将MVS作为端到端的可学习网络,我们称为NLMVS-NET,该网络无缝地集成了放射线线索,以利用表面正常状态作为视图的表面特征,以实现学习成本量的构建和过滤。它首先通过新颖的形状从阴影网络估算出每个视图的像素概率密度。然后,这些每个像素表面正常密度和输入多视图图像将输入到一个新颖的成本量滤波网络中,该网络学会恢复每个像素深度和表面正常。通过与几何重建交替进行交替估计反射率。对新建立的合成和现实世界数据集进行了广泛的定量评估表明,NLMVS-NET可以稳健而准确地恢复自然设置中复杂物体的形状和反射率。
translated by 谷歌翻译
我们将2D盲点估计作为道路场景理解的关键视觉任务。通过自动检测从车辆有利位置阻塞的道路区域,我们可以主动提醒手动驾驶员或自动驾驶系统,以实现事故的潜在原因(例如,引起人们对孩子可能逃脱的道路区域的注意)。在完整3D中检测盲点将是具有挑战性的,因为即使汽车配备了LIDAR,3D推理也会非常昂贵且容易发生。相反,我们建议从单眼相机中学习估计2D中的盲点。我们通过两个步骤实现这一目标。我们首先引入了一种自动方法,用于通过利用单眼深度估计,语义细分和SLAM来生成``地面真相''盲点训练数据,以进行任意驾驶视频。关键的想法是在3D中推理,但要从2D图像定义为那些目前看不见但在不久的将来看到的道路区域。我们使用此自动离线盲点估计来构建一个大规模数据集,我们称之为道路盲点(RBS)数据集。接下来,我们介绍BlindSpotnet(BSN),这是一个简单的网络,该网络完全利用此数据集,以完全自动估算框架盲点概率图,以用于任意驾驶视频。广泛的实验结果证明了我们的RBS数据集的有效性和BSN的有效性。
translated by 谷歌翻译
我们介绍了观看鸟类,从观察者(例如一个人或车辆)捕获的自我为中心的视频中恢复人群地面运动的问题也在人群中移动。恢复的地面运动将为情境理解提供合理的基础,并在计算机视觉和机器人中使用下游应用。在本文中,我们制定了视图鸟化作为几何轨迹重建问题,并从贝叶斯视角推导出级联优化方法。该方法首先估计观察者的运动,然后为每个帧定位周围的行人,同时考虑到它们之间的本地相互作用。我们通过利用人群中的人们的综合和实际轨迹来介绍三个数据集,并评估我们方法的有效性。结果表明了我们方法的准确性,并设定了地面,以进一步研究认为鸟化是一个重要但具有挑战性的视觉理解问题。
translated by 谷歌翻译
The current reinforcement learning algorithm uses forward-generated trajectories to train the agent. The forward-generated trajectories give the agent little guidance, so the agent can explore as much as possible. While the appreciation of reinforcement learning comes from enough exploration, this gives the trade-off of losing sample efficiency. The sampling efficiency is an important factor that decides the performance of the algorithm. Past tasks use reward shaping techniques and changing the structure of the network to increase sample efficiency, however these methods require many steps to implement. In this work, we propose novel reverse curriculum reinforcement learning. Reverse curriculum learning starts training the agent using the backward trajectory of the episode rather than the original forward trajectory. This gives the agent a strong reward signal, so the agent can learn in a more sample-efficient manner. Moreover, our method only requires a minor change in algorithm, which is reversing the order of trajectory before training the agent. Therefore, it can be simply applied to any state-of-art algorithms.
translated by 谷歌翻译
Neural fields, also known as coordinate-based or implicit neural representations, have shown a remarkable capability of representing, generating, and manipulating various forms of signals. For video representations, however, mapping pixel-wise coordinates to RGB colors has shown relatively low compression performance and slow convergence and inference speed. Frame-wise video representation, which maps a temporal coordinate to its entire frame, has recently emerged as an alternative method to represent videos, improving compression rates and encoding speed. While promising, it has still failed to reach the performance of state-of-the-art video compression algorithms. In this work, we propose FFNeRV, a novel method for incorporating flow information into frame-wise representations to exploit the temporal redundancy across the frames in videos inspired by the standard video codecs. Furthermore, we introduce a fully convolutional architecture, enabled by one-dimensional temporal grids, improving the continuity of spatial features. Experimental results show that FFNeRV yields the best performance for video compression and frame interpolation among the methods using frame-wise representations or neural fields. To reduce the model size even further, we devise a more compact convolutional architecture using the group and pointwise convolutions. With model compression techniques, including quantization-aware training and entropy coding, FFNeRV outperforms widely-used standard video codecs (H.264 and HEVC) and performs on par with state-of-the-art video compression algorithms.
translated by 谷歌翻译
Task-oriented dialogue (TOD) systems are mainly based on the slot-filling-based TOD (SF-TOD) framework, in which dialogues are broken down into smaller, controllable units (i.e., slots) to fulfill a specific task. A series of approaches based on this framework achieved remarkable success on various TOD benchmarks. However, we argue that the current TOD benchmarks are limited to surrogate real-world scenarios and that the current TOD models are still a long way from unraveling the scenarios. In this position paper, we first identify current status and limitations of SF-TOD systems. After that, we explore the WebTOD framework, the alternative direction for building a scalable TOD system when a web/mobile interface is available. In WebTOD, the dialogue system learns how to understand the web/mobile interface that the human agent interacts with, powered by a large-scale language model.
translated by 谷歌翻译
Neural radiance fields (NeRF) have demonstrated the potential of coordinate-based neural representation (neural fields or implicit neural representation) in neural rendering. However, using a multi-layer perceptron (MLP) to represent a 3D scene or object requires enormous computational resources and time. There have been recent studies on how to reduce these computational inefficiencies by using additional data structures, such as grids or trees. Despite the promising performance, the explicit data structure necessitates a substantial amount of memory. In this work, we present a method to reduce the size without compromising the advantages of having additional data structures. In detail, we propose using the wavelet transform on grid-based neural fields. Grid-based neural fields are for fast convergence, and the wavelet transform, whose efficiency has been demonstrated in high-performance standard codecs, is to improve the parameter efficiency of grids. Furthermore, in order to achieve a higher sparsity of grid coefficients while maintaining reconstruction quality, we present a novel trainable masking approach. Experimental results demonstrate that non-spatial grid coefficients, such as wavelet coefficients, are capable of attaining a higher level of sparsity than spatial grid coefficients, resulting in a more compact representation. With our proposed mask and compression pipeline, we achieved state-of-the-art performance within a memory budget of 2 MB. Our code is available at https://github.com/daniel03c1/masked_wavelet_nerf.
translated by 谷歌翻译
Neural Architecture Search (NAS) for automatically finding the optimal network architecture has shown some success with competitive performances in various computer vision tasks. However, NAS in general requires a tremendous amount of computations. Thus reducing computational cost has emerged as an important issue. Most of the attempts so far has been based on manual approaches, and often the architectures developed from such efforts dwell in the balance of the network optimality and the search cost. Additionally, recent NAS methods for image restoration generally do not consider dynamic operations that may transform dimensions of feature maps because of the dimensionality mismatch in tensor calculations. This can greatly limit NAS in its search for optimal network structure. To address these issues, we re-frame the optimal search problem by focusing at component block level. From previous work, it's been shown that an effective denoising block can be connected in series to further improve the network performance. By focusing at block level, the search space of reinforcement learning becomes significantly smaller and evaluation process can be conducted more rapidly. In addition, we integrate an innovative dimension matching modules for dealing with spatial and channel-wise mismatch that may occur in the optimal design search. This allows much flexibility in optimal network search within the cell block. With these modules, then we employ reinforcement learning in search of an optimal image denoising network at a module level. Computational efficiency of our proposed Denoising Prior Neural Architecture Search (DPNAS) was demonstrated by having it complete an optimal architecture search for an image restoration task by just one day with a single GPU.
translated by 谷歌翻译
An important challenge in vision-based action recognition is the embedding of spatiotemporal features with two or more heterogeneous modalities into a single feature. In this study, we propose a new 3D deformable transformer for action recognition with adaptive spatiotemporal receptive fields and a cross-modal learning scheme. The 3D deformable transformer consists of three attention modules: 3D deformability, local joint stride, and temporal stride attention. The two cross-modal tokens are input into the 3D deformable attention module to create a cross-attention token with a reflected spatiotemporal correlation. Local joint stride attention is applied to spatially combine attention and pose tokens. Temporal stride attention temporally reduces the number of input tokens in the attention module and supports temporal expression learning without the simultaneous use of all tokens. The deformable transformer iterates L times and combines the last cross-modal token for classification. The proposed 3D deformable transformer was tested on the NTU60, NTU120, FineGYM, and Penn Action datasets, and showed results better than or similar to pre-trained state-of-the-art methods even without a pre-training process. In addition, by visualizing important joints and correlations during action recognition through spatial joint and temporal stride attention, the possibility of achieving an explainable potential for action recognition is presented.
translated by 谷歌翻译